-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CPU LLM Integration Test #373
Conversation
Looks like we should maybe paste together the sharktank and shortfin ci ymls to make a new environment to do integration testing on. @ScottTodd need your opinion on this. Maybe also make it nightly because it is pretty slow running. But I kindda want to make sure nobody pushes something to break shortfin serving & maybe we could make it fast enough via caching the HF downloads and tokenizer conversions. |
Reached out to @ScottTodd this morning. Am currently working on setting up a workflow file |
Sounds good! Lmk when you set that up - - I also have tests that need to go under the same bucket. |
@ScottTodd Thanks for the feedback! Working on requested changes and setting up CI workflow |
Small patch reflecting some recent changes in `sf.Program` and `sf.ProgramFunction`. Was originally included as part of this PR, which adds an integration test to shortfin llm serving: #373 But, parsing it out, since that may take a little more time to make adjustments/add workflow file. Without it, you get the following error when trying to launch the server: ```text [2024-10-30 11:59:09.939] [info] [manager.py:40] System manager command processor stopped [2024-10-30 11:59:09.991] [error] [on.py:121] Traceback (most recent call last): File "/home/amd/stephen/repos/forks/SHARK-Platform/.venv/lib/python3.12/site-packages/starlette/routing.py", line 693, in lifespan async with self.lifespan_context(app) as maybe_state: File "/home/amd/.pyenv/versions/3.12.5/lib/python3.12/contextlib.py", line 210, in __aenter__ return await anext(self.gen) ^^^^^^^^^^^^^^^^^^^^^ File "/home/amd/stephen/repos/forks/SHARK-Platform/.venv/lib/python3.12/site-packages/shortfin_apps/llm/server.py", line 42, in lifespan service.start() File "/home/amd/stephen/repos/forks/SHARK-Platform/.venv/lib/python3.12/site-packages/shortfin_apps/llm/components/service.py", line 69, in start self.inference_program = sf.Program( ^^^^^^^^^^^ TypeError: __new__(): incompatible function arguments. The following argument types are supported: 1. __new__(cls: object, modules: collections.abc.Sequence[_shortfin_default.lib.local.ProgramModule], *, devices: collections.abc.Sequence[_shortfin_default.lib.local.Device], trace_execution: bool = False, isolation: _shortfin_default.lib.local.ProgramIsolation = ProgramIsolation.PER_FIBER) -> _shortfin_default.lib.local.Program Invoked with types: nanobind.nb_type_0, kwargs = { modules: list, fiber: _shortfin_default.lib.local.Fiber, trace_execution: bool } [2024-10-30 11:59:09.991] [error] [on.py:59] Application startup failed. Exiting. ``` With it, you're able to start server, send requests, and receive responses.
Pushing in pieces to keep better track of feedback |
The CI - shortfin - ASan workflow is failing on the integration test, With the reason being a leak in the sentencepiece module: ==5818==ERROR: LeakSanitizer: detected memory leaks
Direct leak of 24 byte(s) in 1 object(s) allocated from:
#0 0x562fe591c1a3 in malloc (/home/runner/work/SHARK-Platform/SHARK-Platform/pyenv/versions/3.12.3/bin/python3.12+0xc71a3) (BuildId: 967a1d6b86eb9d5315fb009c5d4c54dd3a71a6cb)
#1 0x7f9854031cd8 in _PyObject_New /tmp/python-build.20240924192752.4394/Python-3.12.3/Objects/object.c:319:33
#2 0x7f983ba6f7f4 (/home/runner/work/SHARK-Platform/SHARK-Platform/pyenv/versions/3.12.3/lib/python3.12/site-packages/sentencepiece/_sentencepiece.cpython-312-x86_64-linux-gnu.so+0x6f7f4) (BuildId: 27ff3b63261f7a54f28902f49e11139b4d3b711f) This module had to be added for the tokenizer to work in the integration test: E ImportError:
E LlamaTokenizer requires the SentencePiece library but it was not found in your environment. Checkout the instructions on the
E installation page of its repo: https://github.com/google/sentencepiece#installation and follow the ones
E that match your environment. Please note that you may need to restart your runtime after installation. The only test that I see which uses AutoTokenizer would be the CPU LLM integration test, which is outside of shortfin tests. The same tests pass ASan prior to sentencepiece being installed, so it's making me think that it's something under the hood at the build/installation layer which is causing the memory leaks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Mostly LGTM. A few comments about how this could continue to evolve.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM once we decide on which triggers (pull_request
, push
, schedule
) are appropriate.
on: | ||
workflow_dispatch: | ||
schedule: | ||
# Weekdays at 13:00 UTC = 05:00 PST / 06:00 PDT. | ||
- cron: "5 4 * * 1-5" | ||
pull_request: | ||
push: | ||
branches: | ||
- main |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@saienduri is nodai-amdgpu-mi250-x86-64
available enough for the pull_request
trigger here? Runners matching that label are currently busy running on-demand jobs in E2ESHARK Test Suite: https://github.com/nod-ai/SHARK-TestSuite/actions/runs/11673948346/job/32505667904
This job itself takes ~5 minutes today, but we shouldn't be queueing for 1h+.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah the job itself is a nightly, but maybe time to allocate a mi250 runner just for SHARK-Platform
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good and makes sense as we scale these integration tests in the future. What needs to happen for us to enable that?
Add integration tests for llm server application on cpu
Moved `cpu_llm_server_test.py` to top level testing directory, Added `iree-llvmcpu-target-cpu=host` in `cpu_settings`, Skip if transformers is not installed, Add `available_port` fixture to find an available port instead of hardcoding, Use `/health` route to check for server availability
Add `ci-shark-and-shortfin.yml` workflow, Remove `llama_export_compile_serve.sh` script, Remove `test.yml` workflow file
Remove remove `exists` checks for non-cached files
Parameterize settings and batch_sizes
Add logging throughout test, Move integration test to `build_tools/integration_tests/llm`, Rename ci file
beb5757
to
00244e6
Compare
Something about this new test started failing overnight. The test might be flaky, or it might not be hermetic enough (e.g. by downloading dependencies that aren't pinned to specific versions). https://github.com/nod-ai/SHARK-Platform/actions/workflows/ci-shark-platform.yml?query=branch%3Amain The workflow initially succeeded at this run/commit, but then failed on a retry: https://github.com/nod-ai/SHARK-Platform/actions/runs/11704581609 |
Yeah, for some reason the LLM Server started returning faulty responses:
Which caused an assertion failure. I'm seeing the exact same output for a failure here: https://github.com/nod-ai/SHARK-Platform/actions/runs/11706659706/job/32604263601?pr=435 The action is not pinned to a specific version for IREE, so my initial thought is that may be the culprit: pip install --no-compile -f https://iree.dev/pip-release-links.html --src deps \
-e "git+https://github.com/iree-org/iree-turbine.git#egg=iree-turbine" Others specify an installation candidate:
In a way this does highlight that there may be a bug between shortfin and latest IREE, but it shouldn't block development. I used the installation instructions from the README for installing IREE as opposed to a specific candidate to catch stuff like this, but that's kind of annoying from a developer perspective. I'm thinking we pin it to a specific version and we'll catch these kinds of issues when we bump IREE. |
shortfin/python/shortfin_apps/llm/components/service.py
to re-enable llm serving after recent changes